On Development of Consistently Punctuated Speech Corpora
نویسندگان
چکیده
Punctuation of automatically recognized speech is important to enhance readability of transcripts and to aid downstream NLP processing. This paper is concerned with issues involved in developing training and test corpora for automatic punctuation systems. Punctuation annotation in speech transcripts is difficult since there are numerous cases for which no standard punctuation rules exist. Special punctuation annotation guidelines tailored to spoken language were developed. Using these guidelines, almost 100 hours of broadcast news and conversation data in English and French have been punctuated by trained annotators. Measures of inter-annotator agreement are provided for both languages and differences between languages and genre are analyzed and discussed, along with some of the most frequent disagreements between annotators. Overall, using the guidelines, the annotation consistency has been significantly improved.
منابع مشابه
Do speech recognizers prefer
In this contribution we examine large speech corpora of prepared broadcast and spontaneous telephone speech in American English and in French. Starting with the question whether ASR systems behave differently on male and female speech, we then try to find evidence on acoustic-phonetic, lexical and idiomatic levels to explain the observed differences. Recognition results have been analysed on 3-...
متن کاملDevelopment of annotated Bangla speech corpora
This paper describes the development procedure of three different Bangla read speech corpora which can be used for phonetic research and developing speech applications. Several criteria were maintained in the corpora development process that includes considering the phonetic and prosodic features during text selection. On the other hand, a specification was maintained in the recording phase as ...
متن کاملSpoken language resources for Cantonese speech processing
This paper describes the development of CU Corpora, a series of large-scale speech corpora for Cantonese. Cantonese is the most commonly spoken Chinese dialect in Southern China and Hong Kong. CU Corpora are the first of their kind and intended to serve as an important infrastructure for the advancement of speech recognition and synthesis technologies for this widely used Chinese dialect. They ...
متن کاملThe Effect of Colligational Corpus-based Instruction on Enhancing the Pragmalinguistic Knowledge of Request Speech Act among Iranian Intermediate EFL Learners
This study investigated the effectiveness of colligational corpus-based instruction on enhancing the pragmalinguistic knowledge of speech act of request among Iranian intermediate EFL learners. The objective of the study was to find out whether or not providing students with corpora through using colligational instruction had any significant effects on enhancing their pragmalinguistic knowledge...
متن کاملRecent Advances of Speech Databases Development Activity for Indian Languages
Development of Speech Corpora and acoustic–phonetic data bases are indispensable for any research and development work in spoken language systems. Systematic efforts have been made to create speech databases for some major languages of India. The paper attempts to present the status and the recent advancements made in corpora development for some of the Indian languages. Different types of data...
متن کامل